What We Will Be Doing

This notebook will be linked to the features mentioned in the Medium article with regard to Bokeh. Specifically we will use data on NYC apartments to look at the relationship between price and square footage, while showing off some cool features of Bokeh. To get the data just go to this GitHub Repo

Libraries

In [1]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Range1d, HoverTool
from bokeh.embed import components
from bokeh.io import curdoc
from bokeh.themes import Theme
import pandas as pd
import numpy as np

This little function will make our plots show up nice and cleanly in Jupyter - adios Matplotlib!

In [2]:
output_notebook()
Loading BokehJS ...

Our Data

This data comes from a little pipeline I built that was outline in this medium article on AWS Lambda Pipelines. Basically our data is on New York City apartments, which is scraped from Craigslist over June and July 2019. This data from Craigslist has a few enrichments which brings in data from Mapquest and Walk Scores, but it should be pretty intuitive to understand.

Import our data

Read in our data and let's convert the date column to a datefield

In [3]:
df = pd.read_csv('data/nyc_apartments.csv')
df['date'] = pd.to_datetime(df['datetime'], infer_datetime_format=True).dt.date
df.head()
Out[3]:
id address area bedrooms bikeScore datetime distanceToNearestIntersection has_image has_map name ... month dow day hour advertises_no_fee is_repost sideOfStreetEncoded postalCodeChopped neighborhood date
0 6911917730 320 Chauncey St NaN 3.0 64.0 2019-06-21 14:34:00 0.000000 1 1 you’re in good hands...t e x t us to view bk’s... ... 6 4 21 14 1 0 1.0 11233.0 Southeast Bronx 2019-06-21
1 6917210186 530 W 143rd St 800.0 1.0 88.0 2019-06-21 14:33:00 203.483553 1 1 spacious 1br penthouse with deck!! near col un... ... 6 4 21 14 0 1 0.0 10031.0 Upper West Side 2019-06-21
2 6914527887 410 Pulaski St NaN 3.0 79.0 2019-06-21 14:33:00 0.013114 1 1 this is the one you’ve been looking for… call ... ... 6 4 21 14 0 0 1.0 11221.0 Sunset Park 2019-06-21
3 6914529944 410 Pulaski St NaN 3.0 79.0 2019-06-21 14:33:00 0.013114 1 1 simplify your search with us**pro team w/ big ... ... 6 4 21 14 1 0 1.0 11221.0 Sunset Park 2019-06-21
4 6917173545 4754 Center Blvd 653.0 1.0 81.0 2019-06-21 14:33:00 61.301497 1 1 sunny 1br in long island city. brand new renov... ... 6 4 21 14 1 1 0.0 11109.0 Queens 2019-06-21

5 rows × 32 columns

Column Data Source

Bokeh has something called a "ColumnDataSource", which will quickly become your best friend. You can read about it in the docs, but the high level way to think about it is it converts your Pandas dataframe to something Bokeh can easily use. You can see how we utilize this weapon of mass plotting in the charts below, but the general process is:

  • Get your data in the proper format with pandas
  • Make this properly formatted dataframe a ColumnDataSource
  • Use this ColumnDataSource when you call your plotting function

We can very easily create a ColumnDataSource with any dataframe.

In [4]:
source = ColumnDataSource(df)

Let's Create Some Visualizations

Price vs. Square Footage

In [5]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

show(p)

Let's make it more beautiful

The output isn't bad per se, but let's make it visually more appealing:

  • Fill the whole width
  • Bigger title
  • Get rid of those ugly toolbar icons
  • Remove gridlines
  • Change the font size of our axis'

This can all be done with some easy code switches

In [6]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure, now with the sizing mode feature
p = figure(title="Price vs. Square Footage", sizing_mode="stretch_width", tools=[], toolbar_location=None)

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

# Grid lines and and font size
p.xgrid.grid_line_color, p.ygrid.grid_line_color = None, None
p.xaxis.major_label_text_font_size, p.yaxis.major_label_text_font_size = '11pt', '11pt'
p.title.text_font_size='14pt'

show(p)

Themes - Isn't it annoying to make these style changes on every chart?

Yes! What if I always want my title to be size 14? Or I always want there to be no grid? Having to type these in for every chart will get quite old quickly. Introducing themes!

In [7]:
curdoc().theme = Theme(json={'attrs': {

# apply defaults to Figure properties
'Figure': {
    'toolbar_location': None,
    'outline_line_color': None,
    'min_border_right': 10,
    'sizing_mode': 'stretch_width'
},

'Grid': {
    'grid_line_color': None,
},
'Title': {
    'text_font_size': '14pt'
},

# apply defaults to Axis properties
'Axis': {
    'minor_tick_out': None,
    'minor_tick_in': None,
    'major_label_text_font_size': '11pt',
    'axis_label_text_font_size': '13pt',
    'axis_label_text_font': 'Work Sans'
},
# apply defaults to Legend properties
'Legend': {
    'background_fill_alpha': 0.8,
}}})

Now let's use the code from our original plot. We see we get everything done for us automatically.

Price vs. Square Footage

Again, we now don't specify any of the styling attributes manually.

In [8]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

show(p)

Adding a color column

One nice feature of Bokeh is you can leverage Pandas to create columns and then use them in your plot. In this example we will map each discrete value for bedrooms to a color and then use that to color out plot.

In [13]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)

show(p)

Adding Interactivity

Let's add some of those cool tooltips like Tableau has!

In [10]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)

# Create our tooltip
tooltips = """
<div style="width:500px;">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Craigslist URL: </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">@url</h5>
</div>
<div class="tooltip-section">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Price ($): </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">$@price{0,0}</h5>
</div>
<div class="tooltip-section">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Square Footage: </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">@area{0,0}</h5>
</div>
"""

p.add_tools(HoverTool(tooltips=tooltips))

show(p)

That's It

That is just the beginning of Bokeh. If you do use Pandas a lot I highly encourage you to continue learning with Bokeh as it has really served me well for creating visualizations, especially if you are them a lot with colleagues.